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Abstract — Cognitive  networking  mechanisms  promise  to 
greatly  improve  network  performance  over  non-cognitive 
mechanisms,  by  making  more  efficient  use  of  bandwidth,  spec¬ 
trum,  and  power.  However,  these  mechanisms  must  be  designed 
with  cyber  security  in  mind  in  order  to  remain  efficient  in  the 
presence  of  subverted,  adversarial  participants.  In  this  paper, 
we  demonstrate  the  susceptibility  of  two  specific  cognitive- 
networking  mechanisms  to  a  single  Byzantine  participant. 
Specifically,  we  describe  a  novel  ’energy  well’  attack  against 
Q-routing,  in  which  a  Byzantine  participant  can  attract  traffic 
meant  for  an  honest  participant.  Secondly,  we  describe  a  denial 
of  service  attack  against  a  no-regret  learning  algorithm  lor 
Dynamic  Spectrum  Access  (DSA),  in  which  a  single  Byzantine 
participant  can  degrade  network-wide  performance  for  an 
arbitrary  amount  of  time.  These  attacks  demonstrate  why 
cyber  security  techniques  must  be  designed  into  cognitive 
mechanisms  before  use  in  the  tactical  field  so  that  they  do  not 
fail  to  tolerate  adversarial  behavior.  We  conclude  by  discussing 
possible  mitigation  concepts  and  future  work. 

I.  Introduction 

Cognitive  networking  mechanisms  promise  to  greatly 
improve  network  performance  (when  compared  to  non- 
cognitive  mechanisms)  by  making  more  efficient  use  of 
bandwidth,  spectrum,  power,  and  so  on.  They  also  promise 
to  greatly  ease  the  task  of  network  management  by  making 
networks  self-aware  and  self-configuring,  to  some  degree. 
These  advances  will  benefit  both  military  and  civilian  net¬ 
works.  Military  networks,  however,  must  remain  robust  and 
resilient  in  the  face  of  cyberattacks  well  beyond  what  is 
commonly  seen  in  the  civilian  sphere.  Furthermore,  these  at¬ 
tacks  may  be  of  a  very  sophisticated  nature,  or  planned  well 
in  advance.  Such  attacks  may  attempt  to  subvert  network 
equipment  at  manufacture  or  in  the  supply  chain  (a  lifecycle 
attack),  they  may  attempt  to  implant  malicious/adversarial 
code  on  network  equipment  through  so-called  ‘buffer  over¬ 
flows’  (an  injection  attack),  and  they  may  attempt  to  subvert 
a  legitimate  operator  of  network  equipment  (the  insider 
attack). 

In  all  of  these  cases,  a  cognitive  network  may  include  le¬ 
gitimate  participants  who  are  actually  acting  in  an  adversar- 

This  work  is  sponsored  by  the  United  States  Army  under  Air  Force 
Contract  FA8721-05-C-0002.  Opinions,  interpretations,  conclusions  and 
recommendations  ate  those  of  the  authors  and  are  not  necessarily  endorsed 
by  the  United  States  Government. 


ial  manner.  Although  cognitive-networking  mechanisms  may 
achiere  high  and  improved  performance  in  non-adversarial 
settings,  this  does  not  guarantee  that  they  will  remain  robust 
and  high-performing  in  adversarial  ones.  As  \vc  show  in 
this  paper,  in  fact,  cognitive-networking  mechanisms  may  be 
quite  vulnerable  to  even  a  single  mah'cious  insider.  As  part 
of  a  recent  collaboration  between  MiT  Lincoln  Laboratory 
and  the  U.S.  Army  Communicatiuns-Electronics  Research, 
Development,  and  Engineering  Center  (CERDEC)  into  the 
security  of  cognitive-networking  mechanisms,  v/e  analyzed 
the  security  of  tw  o  specific  algorithms: 

•  The  ‘Q-routing’  algorithm  for  packet-routing  in  mobile, 
ad-hoc  networks  [1],  and 

•  The  ‘no  regret’  algorithm  for  dynamic  spectrum  anal¬ 
ysis  [3]. 

In  both  cases,  we  found  that  the  performance  of  the  algo¬ 
rithm  can  be  significantly  degraded  for  a  large  portion  of  the 
network  by  a  single  mis-behaving  adversary.  In  this  work, 
we  describe  these  two  algorithms  and  the  single-participants 
attacks  we  found  (Sections  II  and  III).  We  finish  with  some 
observations  and  conclusions  (Section  IV). 

II.  Q-routing 

A.  Background 

At  its  core,  the  problem  of  packet  routing  in  a  mobile, 
ad-hoc  network  (MANET)  routing  is  a  sequence  of  simple 
local  decisions.  At  each  point  in  a  packet’s  path  from  source 
to  destination,  an  intermediate  node  comes  into  possession 
of  the  packet  and  makes  a  simple-sounding  decision:  which 
neighbor  should  receive  the  packet  next?  This  can  actually  be 
quite  difficult  to  determine  in  MANETs.  First,  the  network 
topology  of  the  MANET  will  change  over  time — sometimes 
quite  quickly.  What  was  once  the  ‘best’  neighbor  in  the 
past  may  not  be  the  ‘best’  neighbor  now.  Second,  the  ‘best’ 
neighbor  may  not  be  the  topologically  nearest  neighbor. 
Network  congestion  may  saturate  the  capacity  of  links  or 
nodes  in  a  MANET,  and  so  it  may  sometimes  be  ‘best’  to 
send  the  packet  away  from  its  destination  in  an  effort  to 
reach  a  less-saturated  path.  Lastly,  the  changing  topology  of 
a  MANET  means  that  the  ‘best’  neighbor  of  the  past  may 
not  even  be  a  neighbor  now. 
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Figure  1 .  Example  Q-routing  MANET-neighborhood 


In  this  section,  we  consider  one  cognitive  MANET-routing 
protocol:  Q-routing. 

B.  The  Q-routing  algorithm 

The  Q-routing  algorithm  [1]  is  a  machine-learning  al¬ 
gorithm  based  on  reinforcement  learning.  As  opposed  to 
other  MANET-routing  protocols  (e.g.,' Optimized  Link  State 
Routing  (OLSR)  [2])  the  Q-routing  algorithm  does  not  build 
spanning-trees  of  the  MANET.  Instead,  each  node  in  a  Q- 
routing  MANET  requests  (and  trusts)  a  time-required-to- 
deliver  estimate  from  each  of  its  neighbors. 

Specifically,  every  node  n*  maintains  a  table  Qi  with  one 
row  for  each  neighbor  and  one  column  for  each  destination 
in  the  network.  The  cell  Qi(j,d),  for  example,  is  held  in 
table  Qi  (which  is  maintained  by  n*)  and  associated  with 

1)  nfs  neighbor  rij,  and 

2)  destination  d. 

In  that  cell,  n *  will  maintain  a  value  which  estimates  how 
quickly  a  packet  will  reach  destination  d  if 

■  the  packet  is  given  to  n*,  and 

•  node  nx  passes  it  on  to  rij . 

This  estimate  will  be  the  total  of  three  components: 

•  The  time  it  takes  rii  to  process  the  packet  (including 
the  time  it  stays  in  rii’s  queue), 

•  The  time  required  for  to  transmit  the  packet  to  rij , 
and 

■  The  time  it  takes  for  the  packet  to  reach  d  after  being 
received  by  nj . 

We  illustrate  Q-routing  with  a  running  example.  Con¬ 
sider  the  MANET-neighborhood  of  Figure  1.  Node  na 
maintains  the  table  Qa,  which  is  shown  in  part.  We  note 
that  Qa(nb,rib)  is  fairly  small,  as  are  Qa(nc,  nc)  and 
Qa(nd,  rid).  This  makes  sense:  once  a  packet  for  rib  reaches 
na,  the  only  time-consuming  events  remaining  are  (1)  for 
the  packet  to  make  it  through  na’s  queue,  and  (2)  for  the 
packet  to  be  transmitted  to  na.  The  entry  in  Qa{nc,rib)  is 
larger,  as  a  packet  for  rib  will  take  longer  to  reach  na  if 
routed  through  nc  first. 

To  route  a  packet  for  destination  d,  a  node  n*  consults 
its  table  to  find  the  ‘fastest’  neighbor.  That  is,  it  looks  at 


column  d  of  its  table  Qi  and  finds  the  row  with  the  smallest 
entry: 


arg  mi  nQi(n,d) 

n 

Returning  to  our  example,  suppose  that  na  receives  a 
packet  for  destination  ne.  It  then  it  consults  the  column  of 
Qa  for  ne.  Looking  at  that  column,  it  sees  that  the  lowest 
entry  is  Qa{rib,  ne).  Therefore,  it  forwards  the  packet  to  na- 

The  table  Qi  makes  the  ‘routing  problem’  extremely  easy 
to  solve — provided  that  the  table  is  accurate  and  timely.  The 
core  of  Q-routing,  then,  is  the  mechanism  by  which  the 
tables  are  built  and  updated.  The  core  idea  of  this  process  is 
as  follows:  when  a  node  rii  routes  a  packet  to  a  neighbor  rij, 
it  also  queries  that  neighbor  about  the  packet’s  destination. 
The  neighbor  rij  will  provide  a  value  from  its  table,  and 
the  original  node  ,i.  will  use  this  estimate  to  update  its  own 
table. 

Specifically,  suppose  node  m  routes  packets  for  destina¬ 
tion  d  to  its  neighbor  nj.  Each  time  it  does  so,  it  queries  rij 
for  about  its  distance  to  d.  Node  iij  will  look  in  its  table  Qj 
for  the  column  associated  with  d,  and  return  the  lowest  value 
in  that  column.  Call  this  value  v.  When  rii  receives  a  value  v 
from  rij,  it  understands  this  to  represent  the  time  it  will  take 
a  packet  to  arrive  at  d  once  the  packet  arrives  at  rij .  Node  rii 
wishes,  however,  to  estimate  a  slightly  different  value:  the 
time  required  for  a  packet  to  arrive  at  d  once  it  arrives  at  rii, 
given  that  it  is  then  passed  on  to  rij.  To  estimate  this  value, 
it  adds  to  v  latency  of  its  queue,  ,s,  and  the  transmission¬ 
time  to  send  a  packet  to  rij,  t.  Node  nfs  new  estimate  (for 
time  between  (a)  receiving  a  packet  for  d,  and  (b)  the  packet 
arriving  at  d,  given  that  it  routes  the  packet  through  rij)  is 

v  +  s+t 

However,  node  rii  does  not  immediately  change  Qi{j,d) 
to  this  value.  To  do  so  might  make  the  algorithm  overly 
responsive  to  transient  changes  in  the  network.  Instead,  node 
n,  changes  Qi{j,d)  to  be  closer  to  this  new  estimate,  and 
does  so  by  applying  a  discount  to  the  new  information. 

In  particular,  let  the  Q-routing  algorithm  be  parameterized 
by  a  discount  factor  0  <  p,  <  1.  In  this  case,  node  rii  adjusts 
the  value  in  Qfj,  d)  by  the  following  amount: 

A  Qfj,  d)  =  p(v  +  s  +  t-  Qi(j,  d)) 

In  our  running  example,  suppose  that  na  queries  rib  about 
node  ne.  Then  rib  queries  its  internal  table  Qb  about  the 
column  associated  with  ne.  (See  Figure  2.)  The  smallest 
value  in  this  column  is  12,  and  so  rib  returns  d  =  12  to  na. 
Node  na  adds  to  this  the  latency  of  its  own  queue  (s  =  4) 
and  the  transmission  time  for  sending  a  packet  to  rib  (£  =  2) 
and  arrives  at  a  new  estimate  of  18.  The  value  currently  in 
its  table  for  Qa(rib,ne),  however,  is  20. 

Suppose  that  this  instance  of  Q-routing  is  running  with  a 
discount  factor  of  p  =  \ .  Then  na  computes  an  adjustment 
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Figure  2.  Node  74  operation  in  Q-routing 
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Figure  3.  Node  na’s  update  to  table  Qa 


of 


AQa(nb,ne)  /u(v  +  s  + 1  -  Qi(j,d)) 

=  ^  (12  +  4  + 1  —  20) 

=  -1 

It  then  applies  this  adjustment  to  Qa{nb,ne )  and  ends 
(Figure  3)  with  a  new  value  of  Qa(iib,  ne)  —  19. 

C.  Our  attacks 

In  our  security  analysis  of  the  Q-routing  algorithm,  we 
uncovered  two  ‘denial-of-service’  attacks  which  could  be 
launched  by  a  single  subverted  insider: 

•  The  ‘random  walk’  attack:  In  this  attack,  the  adversary 
provides  false  ‘distance’  values  for  a  particular  desti¬ 
nation.  These  values  might  be  random  values,  chosen 
afresh  each  time.  Alternately,  they  can  be  chosen  to 
slowly  increase  or  decrease  at  a  rate  calculated  to 
‘resonate’  with  the  algorithm’s  discount  rate.  In  both 
cases,  the  purpose  of  the  false  and  challenging  values 
is  to  prevent  the  underlying  Q-leaming  algorithm  from 
converging.  In  Figure  4  (taken  from  [1])  one  can  see  the 
performance  of  the  Q-routing  algorithm  over  time,  for 
both  light  and  heavy  network  loads.  (The  performance 
of  shortest-path  routing  is  also  included  for  comparison, 
though  we  do  not  need  it  for  this  discussion.)  As  one 
can  see,  the  Q-routing  algorithm  displays  extremely 
poor  performance  at  first,  while  the  Q-leaming  algo¬ 
rithm  is  still  learning  the  optimal  routes.  During  this 
phase,  packets  are  being  routed  more-or-less  randomly 
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Figure  4.  Performance  of  Q-routing  over  time.  Left  chart  represents  light 
network  ioad,  and  the  right  chart  represents  heavy  network  load.  Height 
of  curve  represents  packet-delivery  time.  (Lotv.r  values  represent  better 
network  performance.)  Diagram  taken  from  [1],  which  included  sliortea- 
palh  routing  as  a  basis  for  comparison. 


(hence  the  name  of  this  attack).  When  the  algorithm 
converges,  on  the  other  hand,  performance  increases 
dramatically.  The  point  of  this  attack,  therefore,  is 
to  keep  the  algorithm  from  converging  as  long  as 
possible  (or  to  return  the  algorithm  to  a  pre-convcrgence 
state).  By  changing  its  reported  ‘distance’  values,  either 
randomly  or  according  to  some  algorithm,  the  adversary 
may  be  able  to  accomplish  this  within  some  sphere  of 
influence  (determined  by  the  discount  rate). 

.  The  ‘energy  well’  attack:  in  this  attack,  the  adversary 
simply  advertises  a  falsely-small  distance  to  a  far- 
distance  destination.  This  false  information  eventually 
propagates  to  the  adversary’s  neighbors,  the  neighbors’ 
neighbors,  and  so  on.  Eventually,  nodes  closer  to  the 
adversary  than  the  destination  will  come  to  believe  that 
the  best  route  to  the  destination  is  toward  the  adversaiy, 
which  could  actually  be  in  the  opposite  direction  from 
the  actual  destination. 

We  note  that  this  second  attack  may  have  long-lasting 
repercussions.  Once  the  adversary  has  been  able  to 
establish  the  ‘energy  well,’  it  has  been  encoded  in  the 
internal  table  of  each  fooled  node.  Even  if  the  adversary 
is  identified  and  removed  from  the  network,  the  energy 
well  will  continue  to  exist.  In  fact,  the  energy  well  will 
continue  to  trap  packets  and  distort  network  traffic  until 
it  is  removed  from  the  internal  table  of  each  affected 
node.  Because  each  node  updates  its  table  based  on 
reports  from  neighbors,  and  because  these  neighbors 
might  also  have  been  fooled  by  the  adversary,  the 
energy  well  will  persist  for  some  time.  (Ultimately,  it  is 
the  neighbors  of  the  adversary  that  will  begin  to  correct 
their  internal  tables  first,  and  the  effect  will  spread  from 
the  center  of  the  well  outward.) 


D.  Our  recommendations 

Although  we  do  not  focus  on  recommendations  or  mit¬ 
igations  in  this  particular  work,  we  note  that  these  attacks 
can  be  prevented  through  network-wide  distribution  (by  each 
participant)  of  local  information.  For  example,  participants 
in  the  Optimized  Link-State  Protocol  (OLSP  [2])  broadcast 
(to  the  entire  network)  information  about  their  immediate 
neighborhood.1  This  information  allows  every  node  to  con¬ 
struct  a  spanning-tree  of  the  entire  network.  In  the  context  of 
a  Q-routing  network,  this  additional  information  would  allow 
participants  to  detect  (and  ameliorate)  the  above  attacks. 

III.  No-regret  learning 

A.  Background 

The  term  ‘dynamic  spectrum  access’  (DSA)  refers  to  the 
notion  that  instead  of  being  limited  to  a  single  frequency, 
a  radio  transmitter  might  wish  to  opportunistically  ‘jump’ 
from  frequency  to  frequency  so  as  to  maximize  their  ability 
to  communicate.  In  this  section,  we  consider  the  setting 
in  which  many  ‘peer’  transmitters  must  share  a  range  of 
frequencies  (broken  into  channels)  over  which  they  share 
equal  claim.2  These  users  can  use  any  of  the  available 
channels,  or  even  multiple  channels  at  once.  These  channels 
have  finite  capacity,  however,  and  can  be  over-saturated  if 
total  usage  exceeds  their  capacity.  Furthermore,  channel¬ 
switching  may  be  an  expensive  operation  which  the  users 
wish  to  avoid  as  much  as  possible. 

To  maximize  their  efficiency,  therefore,  each  user  may 
wish  to  observe  and  predict  tire  behavior  of  the  other  users.  If 
it  can  predict  the  future  behavior  of  these  other  agents  based 
on  past  behavior,  then  it  may  be  able  to  identify  channels 
with  a  high  probability  of  being  ‘free’  for  long  intervals.  It 
can  then  use  that  channel  (or  a  large  fraction  thereof)  for  its 
own  communication  while  minimizing  the  costs  of  channel¬ 
switching. 

B.  The  algorithm 

In  this  section,  we  consider  one  algorithm  for  channel- 
selection  in  this  DSA  setting.  This  algorithm  [5]  draws  on 
an  underlying  result  from  game  theory ,  showing  that  non¬ 
communicating  players  can  achieve  a  correlated  equilibrium 
in  a  repeated  game  [4].  To  apply  this  result  to  dynamic 
spectrum  access,  setting  is  modeled  as  a  game  wherein  the 
participants  are  the  players  and  each  strategy  represents  one 
possible  usage  of  channels.  That  is,  a  strategy  would  have 
the  player  to  use  channel  1  with  rate  r\,  channel  2  with  rate 
r2,  and  so  on,  for  some  ri,  r2,. ..It  is  assumed  that  rates 

1  We  note  that  the  authors  of  the  Q-routing  algorithm  also  consider 
describe  a  variant  algorithm  which  incorporates  this  sort  of  neighborhood¬ 
wide  communication  [1,  Section  3.2],  However,  their  discussion  of  this 
variant  is  motivated  by  performance  concerns  rather  than  security  concerns. 

2The  term  ‘dynamic  spectrum  access’  can  also  be  used  in  a  setting  in 

which  some  frequencies  are  licensed  to  ‘primary’  users  who  use  them  only 
infrequently.  Therefore,  they  are  often  available  to  be  used  opportunistically 
by  ‘secondary’  users.  We  do  not  consider  this  setting  in  this  paper. 


n  are  discretized,  and  so  there  are  only  finitely  many  rates 
from  which  a  player  might  choose.  It  is  further  assumed 
that  there  are  a  finite  number  of  channels,  leading  to  a  finite 
number  of  strategies. 

The  ‘reward’  received  by  each  player  in  the  game  is 
proportional  to  their  total  throughput  over  all  channels, 
which  is  the  sum  of  their  throughput  on  each  channel.  Their 
throughput  on  a  given  channel,  furthermore,  is  proportional 
to  the  rate  they  choose  for  that  channel — unless  the  sum 
of  rates  on  that  channel  exceeds  its  capacity.  In  this  case, 
the  throughput  of  the  channel  drops  to  zero,  and  the  players 
receive  no  utility  from  that  channel  in  the  outcome  of  that 
round.3  It  is  in  each  player’s  interest,  therefore,  to  use  as 
much  as  possible  of  each  available  channel  so  long  as  the 
channels  do  not  become  over-subscribed. 

Lastly,  we  note  that  the  game  is  a  repeated  game,  meaning 
that  secondary  users  play  the  above  game  in  each  round  in 
an  infinite  series  of  rounds.  However,  players  can  adapt  their 
strategies  between  rounds,  and  can  do  so  based  on  previous 
rounds. 

Once  the  DSA  setting  is  represented  as  a  game,  the 
algorithm  of  this  section  (presented  by  [5],  which  simply 
refines  the  more-general  algorithm  of  [4]  for  the  DSA 
setting)  attempts  to  bring  this  game  to  an  equilibrium'. 
an  assignment  of  strategies  (rather,  probability-distributions 
over  all  possible  strategies)  to  players  with  the  following 
property:  no  player  can  independently  (unilaterally)  act  to 
increase  the  overall  utility  of  all  players,  so  long  as  all  other 
players  obey  their  assigned  distribution.  Specifically,  the 
algorithm  attempts  to  find  a  correlated  equilibrium,  which  is 
a  type  of  equilibrium  in  which  the  players  are  given  possibly- 
correlated  signals  as  to  the  strategies  to  select.  In  particular, 
each  player  is  given  their  own  individual  signal,  but  these 
signals  may  be  correlated  with  each  other.  If  the  signals  are 
completely  uncorrelated,  the  resulting  equilibrium  will  be  a 
Nash  equilibrium.  If  the  signals  are  correlated,  however,  the 
resulting  equilibria  may  produce  better  overall  performance 
than  can  be  produced  by  Nash  equilibria.  In  our  setting, 
the  semi-common  signal  is  simply  the  history  of  previous 
rounds,  and  this  is  enough  for  the  algorithm  of  [4]  to  achieve 
correlated  equilibrium — given  that  the  game  is  unchanging 
and  number  of  rounds  is  sufficiently  long. 

To  achieve  correlated  equilibrium  a  completely  distributed 
manner,  each  participant  performs  a  sequence  of  computa¬ 
tions  at  each  round.  Specifically,  it  will  compute  the  regret 
it  currently  associates  with  each  strategy.  The  regret  of  a 
strategy  is  defined  to  be  the  difference  between  two  values: 

•  The  total  utility  the  node  would  have  ‘received’  from 
the  game  over  all  previous  rounds  had  it  followed  its 
current  strategy  each  previous  round,  and 

•  The  total  utility  the  node  would  have  ‘received’  from 

3The  model  also  allows  for  the  possibility  that  some  nodes  are  out  of 
range  from  each  other,  in  which  case  their  actions  are  independent. 


the  game  over  all  previous  rounds  had  it  followed  the 
strategy  in  question  each  previous  round. 

The  first  value  is  subtracted  from  the  second,  and  negative 
values  are  rounded  up  to  zero.  In  both  cases,  the  past  actions 
of  all  other  players  remain  unchanged. 

To  illustrate  this  with  an  example:  suppose  that  a  game 
has  just  finished  its  third  round,  and  that  participant  P 
is  currently  using  strategy  Sq.  To  calculate  the  regret  it 
associates  with  another  strategy  Si,  it  first  computes  two 
values: 

«  First,  the  utility  it  would  have  received  had  it  followed 
strategy  so  for  all  previous  rounds  (which  may  or  may 
not  have  been  the  case).  For  the  sake  of  completeness, 
let  us  suppose  that  it  would  have  received  utility  1  in 
round  1,  utility  3  in  round  2,  and  utility  3  in  round  3.4 
Strategy  so  then  would  have  generated  total  utility  7. 

.  Then,  the  participant  computes  the  total  utility  that 
would  have  been  generated  by  strategy  Si.  With  this 
strategy,  tire  paiticipant  would  have  generated  utility  1 
in  round  1,  utility  1  in  round  2,  and  utility  2  in  round 

3.  Thus,  strategy  si  would  have  generated  total  utility 

4. 

With  these  two  numbers,  the  paiticipant  calculates  its  regret 
for  strategy  si.  Because  strategy  si  would  have  generated 
less  utility  (4)  than  with  its  current  strategy  (w  hich  generated 
utility  7),  strategy  sj  is  assigned  a  regret  of  0. 

Suppose,  on  the  other  hand,  that  the  participant  also 
evaluates  the  regret  assigned  to  another  strategy  s2.  For  the 
purpose  of  this  example,  suppose  that  this  strategy  would 
have  generated  utility  3  in  round  1,  utility  2  in  round  2, 
and  utility  4  in  round  3.  Its  total  utility  is  9,  which  is  larger 
than  the  utility  generated  by  the  current  strategy.  The  regret 
assigned  to  s2  the  difference  between  the  two  utility-totals: 
9-7  =  2. 

If  the  node  associates  a  positive  regret  with  a  strategy,  than 
the  node  would  have  been  ‘better  off’  adopting  that  strategy 
from  the  beginning.  To  choose  its  next  strategy,  then,  the 
node  samples  from  the  following  distribution:  Let  s  be  the 
strategy  used  in  the  previous  round.  Then: 

•  A  strategy  s'  ^  s  is  assigned  a  probability  proportional 
to  its  associated  regret.  (The  proportionality  constant  is 
chosen  at  the  beginning  of  the  game  in  such  a  way  so 
as  to  ensure  that  the  sum  of  these  probabilities  is  less 
than  one.) 

•  The  strategy  s  is  assigned  all  ‘left  over’  probability 
mass. 

To  return  to  our  example,  recall  that  the  current  strategy  of 
participant  P  is  so,  and  that  strategies  si  and  s2  have  regret 
0  and  2  assigned  to  them,  respectively.  Let  us  further  assume 

4Why  are  these  values  not  all  identical?  Because  the  other  participants 
shifted  their  strategies  from  round  to  round,  thus  changing  the  utility  of 
strategy  so- 


that  strategies  S3,  S4  and  S5  have  been  assigned  regrets  3,  4 
and  5,  respectively,  and  that  these  are  all  possible  strategies. 

To  assign  a  probability  to  each  strategy,  the  participant 
retrieves  its  proportionality  constant.  This  constant,  c,  was 
chosen  by  the  participant  at  the  beginning  of  the  game  in 
such  a  way  so  as  to  be  larger  than  any  possible  sum  of  re¬ 
grets.  With  this  constant,  the  participant  assigns  probabilities 
in  the  following  way: 

•  s  1:  probability  0. 

•  s2:  probability  2/c. 

•  S3:  probability  3/c. 

•  S4:  probability  4/c. 

•  S5:  probability  5/c. 

•  so:  all  remaining  probability  mass  (1  —  2±2±i±5) 

It  then  chooses  a  sn-ategy  for  next  round  according  to  this 
distribution. 

We  note  that  because  the  constant  c  of  this  example  is  cho¬ 
sen  to  be  large  (so  as  to  ensure  that  the  probability  assigned 
to  so  will  be  positive)  assigned  to  strategies  s2  through  S5 
may  be  quite  small.  Despite  this,  it  is  shown  (in  [4])  that 
this  procedure  will  'almost  surely’  converge  to  correlated 
equilibrium  for  airy  game.  The  specific  application  of  this 
algorithm  to  DSA  channel  selection  is  demonstrated  in  [5], 
along  with  simulation  results  demonstrating  improvements 
of  5%  to  15%  over  other  game-theoretic  results. 

C.  Our  attack 

As  mentioned  above,  this  algorithm  is  vulnerable  to  a 
simple  ‘denial  of  service’  attack  which  can  be  launched  by  a 
single  subverted  participant.  We  motivate  this  attack  through 
a  simple  example: 

Suppose  that  a  system  of  participants  has  reached  conver¬ 
gence,  and  was  been  in  convergence  for  some  time  t.  (For 
the  sake  of  simplicity,  suppose  that  the  system  has  been 
in  convergence  for  its  entire  history.)  Further  suppose  that 
during  this  time,  participant  P  has  used  the  simple  strategy 
s  of  using  only  channel  c.  By  doing  so,  it  has  received  utility 
u  per  round,  and  therefore  received  cumulative  utility  ut. 

Now,  suppose  at  time  t  that  the  adversary  ‘floods’  channel 
c.  This  means  that  participant  P  will  receive  no  utility  by 
using  channel  c.  It  should  switch  to  a  new  channel,  therefore, 
and  will.  But  when? 

Before  the  participant  will  switch,  it  needs  to  establish  that 
it  ‘regrets’  not  using  some  other  strategy  s'.  To  compute  its 
regret  for  s',  it  computes  two  values: 

•  The  cumulative  utility  it  has  received  so  far,  and 

•  The  cumulative  utility  it  would  have  received  by  purely 
following  s'  from  the  beginning  of  history  (assuming 
that  all  other  participants  remain  as  they  were). 

For  the  participant  to  compute  positive  regret  for  s',  it  must 
be  the  case  that  the  second  value  outweighs  the  first.  But 
when  will  this  be  the  case? 

Let  us  make  the  simplifying  assumption  that  strategy  s' 
will  yield  the  participant  utility  u'  (where  u'  <  u)  per  round. 


Figure  5,  Iterations  required  for  re-con*,  ergence,  «s  function  of  iterations 
previously  in  convergence 


In  this  case,  let  us  compute  the  two  values  above  as  they 
would  be  computed  by  the  participant  at  some  time  t'  ;  t: 

•  Cumulative  utility  actually  received:  the  participant  will 
have  received  utility  ut  until  time  t,  and  no  utility  after 
that.  Thus,  a  total  utility  of  ut. 

.  Cumulative  utility  from  strategy  s':  utility  u'  for  every 
round,  making  a  total  cumulative  utility  of  u't'. 

Thus,  the  participant  will  only  ‘regret’  not  using  strategy  s' 
when 

u't'  >  ut 

Thus,  the  participant  will  only  switch  when  t'  >  ^t.  If, 
for  example,  u'  —  |  u,  then  the  participant  will  only  switch 
when  t'  =  |i.  If  u'  =  then  the  participant  will  only 
switch  when  t'  =  2t.  In  general,  the  larger  the  difference 
between  u  and  u' ,  the  longer  it  will  take  the  participant  to 
switch  from  strategy  s  to  strategy  s'. 

This  example  demonstrates  a  more  general  phenomenon: 
the  longer  the  system  has  been  in  convergence,  the  longer 
participants  will  wait  to  react  to  disruption.  We  also  con¬ 
firmed  this  notion  through  simulation.  By  simulating  a  sim¬ 
ple  system  (three  users,  two  channels),  we  have  been  able  to 
demonstrate  this  intuition  experimentally.  Figure  5  displays 
the  results.  In  this  experiment,  the  original  system  was 
allowed  to  come  to  convergence  and  to  stay  in  convergence 
for  some  number  of  rounds  n.  At  round  n,  one  of  the 
participants  (the  adversary)  unilaterally  increases  their  use 
of  the  channels.  When  this  happens,  the  two  remaining 
participants  will  begin  to  receive  sub-optimal  utility.  They 
will  need  to  re- alloc  ate  the  remaining  bandwidth  between 
themselves  so  as  to  maximize  their  payouts,  which  will  cause 
the  system  to  leave  convergence.  Our  simulation  allows  the 


system  to  continue  until  the  two  honest  participants  re¬ 
converge  to  their  new,  optimal  strategies. 

Let  n'  be  the  number  of  rounds  required  to  re-converge 
(which  is  measured  from  round  n,  not  round  0).  Figure  5 
displays  n'  as  a  function  of  n.  As  can  be  seen,  the  longer 
that  a  participant  has  been  relieving  optimal  utility  from  a 
convergent  system  (n),  the  longer  it  is  willing  to  tolerate 
sub-optimal  utility  before  it  will  switch  strategies  ( n '). 

Multiple  attacks  can  leverage  this  characteristic  of  the 
regret-based  learning  system.  First,  the  ad  vers  317  can  apply 
the  simple  strategy  of  our  simulation:  simply  flooding  the 
channels  used  by  a  participant  of  interest.  It  might  also 
use  the  ‘window  of  opportunity’  to  iterate  through  all 
possible  strategies  to  find  the  most  disruptive  one.  Lastly, 
it  can  attempt  to  prevent  the  system  from  converging  or  re- 
convcrging  by  changing  its  strategies  periodically. 

D.  Our  recommendations 

'The  above  attacks  are  possible  because  of  how  algorithm- 
participants  regard  the  past.  In  particular,  when  participants 
decide  how  much ‘regret’  to  associate  with  a  particular  strat¬ 
egy,  they  compute  the  total  utility  they  would  have  received 
with  that  strategy  over  their  entire  history.  Specifically, 
the  utility  they  would  have  received  in  the  first  round  is 
considered  to  be  on  par  with  the  utility  they  would  have 
received  in  the  immediately-previous  round,  no  matter  how 
long  ago  the  first  round  had  been.  Put  another  way,  the 
participants  of  this  protocol  do  not  discount  the  past.  To 
mitigate  this  attack,  we  recommend  that  the  definition  of 
‘regret’  be  modified  to  value  the  immediate  past  more  highly 
than  the  distant  past.  This  can  be  accomplished  by  simply 
applying  a  discount  factor  which  increases  over  time,  or  to 
simply  consider  a  sliding  window  which  encompasses  only 
the  most  immediate  past. 

IV.  Conclusions 

In  this  work,  wc  considered  two  cognitive-networking 
mechanisms — Q-routing  for  packet-routing  in  MANETS, 
and  a  game-theoretic  algorithm  for  DSA — and  demonstrated 
how  both  could  be  subverted  by  a  single  malicious  insider. 
This  is  not  to  say  that  both  algorithms  are  inherently 
vulnerable;  both  algorithms  could  be  modified  to  be  resistent 
to  the  described  attacks.  (Q-routing  could  be  augmented 
with  some  sort  of  network-wide  information-distribution 
in  the  style  of  OLSR,  and  the  DSA  algorithm  could  be 
modified  to  react  more  quickly  to  disruption.)  Nor  should 
this  work  be  interpreted  as  saying  that  all  cognitive-network 
algorithms  are  vulnerable  to  insider-attacks.  However,  both 
attacks  described  in  this  paper  are  enabled  by  a  key  as¬ 
pect  of  many  cognitive-network  mechanisms:  that  network- 
participants  trust  other  network  participants  to  be  acting 
honestly.  By  doing  so,  these  cognitive-network  mechanisms 
can  more  quickly  react  to  new  information,  or  can  find  opti¬ 
mal  configurations  which  cannot  be  reached  through  purely- 


local  decisions.  However,  these  attacks  do  demonstrate  that 
cognitive-network  mechanisms  must  be  analyzed  from  a 
security  standpoint  before  being  implemented  and  deployed 
on  military  networks. 
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